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Abstract 

The design of new products for consumer markets has undergone a major transformation over the 
last 50 years. Traditionally, inventors would create a new product that they thought might address a 
perceived need of consumers. Such products tended to be developed to meet the inventors own percep- 
tion and not necessarily that of consumers. The social consequence of a top-down approach to product 
development has been a large failure rate in new product introduction. By surveying potential cus- 
tomers, a refined target is created that guides developers and reduces the failure rate. Today, however, 
the proliferation of products and the emergence of consumer choice has resulted in the identification of 
segments within the market. Understanding your target market typically involves conducting a product 
category assessment, where 12 to 30 commercial products are tested with consumers to create a prefer- 
ence map. Every consumer gets to test every product in a complete-block design; however, many classes 
of products do not lend themselves to such approaches because only a few samples can be evaluated 
before 'fatigue' sets in. We consider an analysis of incomplete balanced-incomplete-block data on 12 
different types of white bread. A latent Gaussian mixture model is used for this analysis, with a partial 
expectation-maximization (PEM) algorithm developed for parameter estimation. This PEM algorithm 
circumvents the need for a traditional E-step, by performing a partial E-step that reduces the Kullback- 
Leibler divergence between the conditional distribution of the missing data and the distribution of the 
missing data given the observed data. The results of the white bread analysis are discussed and some 
mathematical details are given in an appendix. 

Keywords: Balanced-incomplete-block; mixture models; progressive EM; sensometrics; white bread. 

1 Introduction 

Consumer-driven product development of new consumer products and the improvement of existing products 
have become recognized as a best-practice in industry. Food researchers have becom e increasingly depen dent 



on understanding consumer wants and desires to effectively design food products (| Jaeger et all |2003) . To 
understand consumer behaviour, preference maps are built by assessing consumer liking of an appropriate 
range of commercial products within a category. From these liking data, a model may be built that describes 
the ideal product for the test population. However, most product categories will have more than a single 
ideal product, with two or more liking clusters revealed. Hed onic taste tasting is the most common practice 
used to measure consumer liking within a target population ( Lawless and Hevmann . 201fj| ). In a complete- 



block design, every consumer gets to taste every product, but many product categories do not facilitate this 
sampling plan. When tasting wine, for example, a consumer can only evaluate three or four samples before 
'fatigue' sets in, compromising the quality of the data collected. The fact that co nsumers tend to behave like 
experts puts into doubt the value of data obtained over multiple tasting sessions ( Findlav . 20081) . Therefore, 



a balanced-incomplete-block (BIB) design is used for high-fatigue products. The resulting data are sparse 
ana! tend to be heterogenous; therefore, we must identify sub-populations in an incomplete-data setting. 
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In this paper, we consider an analysis of 12 white breads. The descriptive analysis of the breads provides 
a measure of the range of sensory properties found within the product category. It also gives us information 
regarding changes that may improve the sensory liking of a product for a specific consumer segment. Con- 
sumer research is conducted to measure liking on a nine-point hedonic scale anchored at dislike extremely 
(1) and like extremely (9), with the midpoint (5) indicating neither like nor dislike. By clustering consumers 
on similarity of liking profiles across the products, it is possible to determine the sensory attributes that 
contribute to like/dislike within each cluster. The calibrated descriptive analysis was performed by a trained 
panel using well-defined product attributes that have been rated for intensity on a scale from to 100. The 
attributes are generated to provide a complete sensory description of the breads. Some attributes are found 
at low intensity, but are important in differentiating products. There are also attributes that are defined 
by a major attribute or group of attributes. For example, sourdough breads would score high in sourness 
and sour aroma and flavour. The products selected for this study encompass the commercial sliced white 
bread category. The range goes from the extremely popular sandwich breads that are fine-celled, spongy, 
and bland to a ciabatta-style Italian hearth bread. The breads differ in crust colour and roughness, texture 
of the crumb, and flavour, but all fall within the realm of sliced white bread. A total of 369 consumers 
evaluated six breads within a 12-present-6 BIB design. 

One straightforward way to tackle such an incomplete-data problem is to impute the missing data prior to 
the analysis. However, this approach is not generally desirable for clustering problems because the imputed 
values will be partly based on data from other sub-populations. Herein, we develop a clustering approach 
for these data based on a finite Gaussian mixture model. A random variable X follows a G-component finite 
Gaussian mixture model if its density can be written 



G 



/(x|t?) = > ^(x| Mg ,S 9 ), 



(1) 



where tt 9 > 0, with X} ff =i 7r 9 = 1' are mixing proportions, </>(x | fi g ,'. 



is multivariate Gaussian den- 
, fe , Si Sg). Finite mix- 



sity with mean fj, g and covariance matrix S s , and i9 = 
ture models have been used for clustering for at least fifty years (Wolfe, 1963) and such applications are 
commonly referred to as 'model-based clustering' (cf. iFralev and Raftervl . 120021 ). One problem with ap- 



plications of Guassian mixture models is the number of free covariance parameters: Gp(p + l)/2. To 
overcome this, many authors have consid ered imposing constraints on decomposed component covariance 
matrices (e.g., Celeux anc Gov aertl . 119951 ) and other have considered underlying latent factor models (e.g. 



Ghahramani and Hintonf . 



19971 ) . We consider an underlying latent factor model herein (cf. Section [3]) and 



because so many of the data are missing, we assume common component covariance. The result is a parsi- 
monious Gaussian mixture model. 

The expectation- maximization (EM) algorith m (Dempster et all 1977) is th e standard approach to pa- 
rameter estimation for model- based clustering (cf. lMcLachlan and Basfor 

iliaaal). 

However, in our incomplete- 
block data we obtain only 6 liking scores from 12 products for each consumer; therefore, an EM algorithm 
would require (g 2 ) = 924 different 6x6 matrix inversions for each mixture component in each E-step. To 
circumvent this problem, we develop a 'partial' EM (PEM) algorithm that requires only a single 12 x 12 
matrix inversion for each mixture component. We show that this PEM retains the monotonicity property 
and thus all of the convergence properties of the standard EM algorithm, but is much more computationally 
efficient than the standard EM algorithm for this particular problem. 

The remainder of this paper is laid out as follows. In Section [21 we review the application of the EM 
algorithm for missing data. Then, our parsimonious Guassian mixture model is presented and the PEM 
algorithm is developed (Section [3J . We apply our method to the white bread data in Section [4j where we 
also compare our PEM algorithm to the standard EM algorithm (Section I4.3[) . The paper concludes with 
discussion and suggestions for future work (Section [5]). 
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2 The EM Algorithm for Missing Data Problems 



The EM algorithm is an iterative procedure for finding maximum likelihood estimates when data are in- 
complete. Therefore, EM algorithms are naturally suited for missing data problems. The EM algorithm 
consists of alternating between E- and M-steps until a convergence criterion is satisfied. In the E-step, the 
expected value of the complete-data (i.e., the observed plus missing data) is computed, and in the M-step, 
this quantity is maximized with respect to the parameters. Formally, the EM a l gorithm is a special case of 
an MM algorithm of the minorization-maximization variety (|Hunter and Lange , 200(1 l2004h . 

Suppose we observe p-dimcnsional yi , . . . , y n such that each can be decomposed into an observed 
component, Xj, of dimension m, a missing component, z,-, of dimension I = p — m, and 



M 2 



^•xz 
^•zz 



Then, the conditional distribution of the missing data given the observed is given by 

Zj|Xj = Xj ^ ■^'(fJ'Zi.Xi := A*z + ^zx^xx ( x i — AO ' ^z.x '■= ^zz — S Z iSjS JZ ). 

Now, define 



Yv := 



0/x; 

O/xm 



O/xm 



■= Vz.Xi = Vz 

o /x; 



^zx^ X x (Xj — H x ) 



O/xm 



Olxl 
Omxi 



O/xm 

^zx^xx ^xz 



(2) 
(3) 



where 0; xm is an I x m matrix of zeros and y^ = (x^, z^). Using this notation, the EM updates for the mean 
and covariance can be written as 



1 

y = - 



and s (t+1) = s = J2 (Si - y) (Si - y)' + E Y - 



(4) 



respectively. Because our white bread data have lots of missing observations, 'standard' E-steps are very 
computationally expensive. For example, each observation requires inversion of a 6 x 6 matrix; this amounts 



to (g 2 ) = 920 different matrix inversions at each iteration. 



Next, consider approximate E-steps instead of full E-steps. All of these procedures will work with the 
inverse of X and its principal sub-matrices and vectors. We will assume that S _1 is known; this quantity 
is typically readily available because it is necessary to calculate the log-likelihood at each EM iteration. We 
denote 



S = 



and 



5T 1 = S = 



•'xx 
-'zx 



•'xz 
-'zz 



because there exists a relationship between a zz and S 2 .x- This relationship exists because S z . x is the Suhur 
complement of the matrix S which has the property that 



1 — 'zz 



and, equivalently, 



In addition, there exists a relationship between the regression coefficients and H, which can be derived 
through block inversion of S, 

y y- 1 - _n-in 

^zx^xx — "zz "zx- 

These relations can be useful if the dimension of z is smaller than the dimension of x. For example, if there 
is a single missing observation then using S requires a p — 1 matrix inversion, whereas if B is known we only 
require an inversion of a 1 x 1 matrix. 
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For the extension G > 1 mixture components, we require the weighted versions of @. The weight for 
observation i in component g is 

_ Trg<l>{xi\Vg,x,Eg,xx) 



where TTg t+1 ' = n g /n = (1/n) E"=i = y 3 = (1/%) Yh=i w igfu and 



9 Y,k=l 7r k(/>(xi\H k , x ,'Zk,xx) 

(t+i) _ = _ 



(5) 



1 " 

s s = — Wi f (y< ~ y) (y* ~ y)' + ^ 



(6) 



where n g = £" =1 ' 



Hg- 



3 Methodology 



3.1 Finite Mixture Models with Common Factors 

As already mentioned (Section [T]), it is common practice to introduce parsimony via constrained component 
covariance matrices. When dealing with sparse data, such as the white bread data, estimating the covariance 
matrix can be especially difficu l t. Therefore, we use a var iant of the mixture of factor analyzers model 
( Ghahramani and Hintonl . 1997 : McLachlan and PeelL 200*0}) in which we cons train the component factor 
loading ma t rices t o be equal acro ss groups (cf. iMcNicholas and Murphy . 20081 ). The factor analysis model 
( Spearmanl . 19041 Bartlett . 19531 ) assumes that a p-dimensional random vector Xj can be modelled using 
a g-dimensional vector of latent factors Uj, where q <C p. The model can be written Xj = \x + AUj + e, 

- N(0, *), where * 
*). It follows that 



where A is a p x q matrix of factor weights, the latent variables Uj ~ A/"(0, I g ), and e 
is a p x p diagonal matrix. Therefore, the marginal distribution of Xj is Af(fi, AA' H 



the density for the mixture of factor analyzers model is that of Equation Q] with S 5 = A g A g + *i> g . The 
model we use for the analysis of the bread data assumes equal factor loading matrices across components, 
i.e., S s = A A' + \I/ 9 , and so the EM algorithm updates are given by 



/ i (new)\ 

( ) 



-(new) 

* 9 = diag 



9=1 



n 



- (new) - „ 



(new) 




(new' 



(7) 



(8) 



3.2 PEM Algorithm 

We follow Neal and Hintonl (|l998l ) and store the sufficient statistics (zi, . . . ,z„) and (Zi, . . . , Z n ). However 
i nstea d of doing a complete update of a partial set of the sufficient statistics as suggested bv lNeal and Hinton 
(jl998l ). we perform a partial E-step at each iteration. These partial E-steps can be shown to reduce the 
Kullback-Leibler ( KL) divergence at every step, which ensures that the monotonicity of the EM algorithm 
is preserved. From iNeal and Hintonl (1998), the EM algorithm can be viewed as minimizing the function 



J2 F ( p n0) = -J2D(p i \\p i}0 ) + J2i(x i \e), 



i=l 



where D(Pi\\Pi^) is the KL divergence between the distribution of the missing data Pi and the conditioal 
distribution of the missing data given the observed data, Pq = P(Zj |xj,6*). A 'standard' E-step sets Pj 
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to P(Zi\xi,&t), for all i, at each iteration t. Neal and Hinton ( 19981) suggest a partial or sparse E-step, 
where a subset of Pi is updated to P(Zj|xj, 6* t ) at each EM iteration. The algorithm we describe in the next 
two sections partially updates each Pi towards P(Zj|xj,#) such that the KL divergence is reduced but not 
minimized. For the multivariate Gaussian distribution and a particular i, the EM algorithm can be viewed 
as minimizing 

F{N z ,*i,9) = -D KL (N Z \\N Z .^) + l(xi\0), 

with respect to N z , the distribution of the latent or missing variables, and the parameter set 9. 

The KL divergence between the missing data distribution with mean Zj and variance Zj and the condi- 
tional distribution of the missing data given the observed data Xj is 



D Kh (N z \\N z . Xi ) = l 



fr{E-iz i } + (i i -^,. x )'E-i(a< 




(9) 



From Equation GO we can see that if we set Zj to the conditional mean and Zj to the conditional covariance 
matrix, then the KL divergence is minimized. However, for our data this involves inverting ( 6 ) = 924 
different 6x6 matrices at each iteration. Finding the minimum distribution for the missing data in each 
row is computationally expensive, so we will instead iteratively minimize the KL divergence on simpler 
computations. 



3.3 Notation 

Hereafter, the following notation will be used. Let Ej be the principal sub- matrix of E, obtained by deleting 
column j and row j. Let Oj and £j be the jth diagonal elements of E and H, respectively. Let crj and £j be 
the jth rows of S and S, respectively, with the jth element deleted. For example, if j = 1, then 



E = 



0i 



Si 



and 



S- 1 = H = 



6 Ci 
ix Si 



(10) 



3.4 Minimizing the KL Divergence With Respect to % 

To minimize the KL divergence with respect to Zj, we set it to fi z x , which is given in @. However, fi z x 
depends on a matrix inversion that depends on i. We now develop an updating equation that reduces the 
number of matrix inversions. The KL divergence depends on Zj through one term in equation (j9|) and 

argmax £> KL (N z \\N z . Xi ) = argmax (y 4 - fi)' (y 4 - /Lt) 



because the last term in 



(y, - n)' E- 1 (y 4 - M ) = (z, - fi z .jS^{z t 



is fixed and, moreover, does not depend on Zj (cf. Appendix IA.1[) . 
We now consider a conditional minimization, by each element in y 



(yn, ■ ■ -,yin), of 



argmax (y, - fi)' E 1 (y 4 - /x) , 



which is given by 



(11) 



(t) 



Vij 



A 1 — fc J if j is corresponds to an element in z, 
if j is corresponds to an element in x. 
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where /ifc is the fcth element of /x. This procedure requires inversion of the Sj, i.e., the jth principal sub- 



matrix; which involves a p — I matrix inversion. However, under the assumption that S 1 = S is known 
the updates can be simplified to 



"' — J — [y-l ~ P—j) ^ 3 * s corres P on ds to an element in %, 
Dij if j is corresponds to an element in Xj. 



This set of updates requires the inversion of a 1 x 1 matrix, which is trivial. These updates are guaranteed to 
converge to the global minimum because the objective function is convex. The advantage of this conditional 
minimization of the KL divergence is that the update is computationally simple. If wc let the n x p matrix 
A = (y' 1; . . . ,y' n ), then our partial E-steps update A by column instead of 'standard' E-steps that update 
A by row. 

3.5 Minimizing the KL Divergence With Respect to Zj 

Now, minimizing the KL divergence with respect to Zj we have 



argmax L» KL {N z \\N z . Xz ) = argmax tr |s z *%\ 



Zj Zj 




The l og-determinant and the trace function are convex with respect to the positive definite matrices (Ma gnus and Neudecker 



1998). Now, consider the function 

7 (Z0 = tr[(E-Yi) IT 1 (S-Yi)] , (12) 
which is also convex, and these functions have the property 

argmax D KL {N z \\N z . Xi ) = argmax tr [(s - %) E" 1 (s - %)] . (13) 



Both objective functions are minimized by the Schur complements (c.f. Appendix IA.2I for the function 7). 
In addition, if one objective function is reduced then so is the other because both functions are convex and 
have the same (global) minimum. Therefore, if the function 7 is reduced at every iteration, then the KL 
divergence is reduced at every iteration and so the algorithm has the monotonicity property. 

Conditional updates for the function 7 are derived in Appendix I A. 21 We update Yj by column and row, 
while holding the associated principal sub-matrix fixed. Therefore, if we denote the jth row of Yj by Yjj 
and the {p— 1) x p matrix obtained from removing the jth row as Y^—j-, then the updates can be written as 

y(*+ x ) _ f o-j + (o-j - Y^*n SJ 1 (S-j - Y-*!^ if j is associated with z*, 
( Oi X p if j is associated with Xj, 

and then we set the jth column of Y,[ t+1 - ) equal to the jth row. 

We can avoid the matrix inversion of the principal sub-matrix Sj by again exploiting the properties of 
the inverse of S. Specifically, if ^er,; ^ 1, then 



S7 1 



t , 1 t ' 



j j - 



where is the (p — 1) x [p — 1) identity matrix. 



G 



3.6 Evaluating Weights and the Likelihood Function 

The likelihood depends only on the observed data 

/(Xl) = (2^|£^|V2 exp H (Xl " fO'S-te ^ 14 ) 
However, from Equation (|11[) we have 

(fi - A*)' (y* - A*) = (z» - M z .x)' s ^x(zi - M 2 .x) + (xi - /iJ'Snlx, - fi x ). (15) 

Therefore, 

(xj - /zj's^Xj - /*j < (y< - m)' (y» - m) = (y< - a*)' h (y* - a*) ( 16 ) 

and we have equality when Zj — \x z x . Our algorithm for Zj will converge to fi z x , so if we use this approxi- 
mation for the likelihood and weights calculations they will also converge to the true quantities. 

To calculate |S ax |) we use the relationship between Schur complements and the determinant, and the 
relationship between the inverse matrix and the Schur complement; 

In |S| = ]n|£ OT | + ln|X3,. fl .| = In \Z XX \- In |S„|. 

Alternatively, we could use our current estimate of 'S z _ Xi , namely, Z^. However, note that if the dimension 
of the missing data is larger than the dimension of the observed data for observation i then it will better to 
calculate \H X x\ directly. 



3.7 Model Selection 

The Bayesian information criterion (BIC; ISchwarz , 19781) is used to select the number of components G and 
the number of latent factors q. For a model with parameters 0, BIC = 2Z(x, 9) —m log n, where Z(x, 6) is the 
maximized log-likelihood, is the maximum likelihood estimate of 6, m is the number of free parameters in 
the model, and n is the numbe r of observations. The use of the BIC in mixtur e model selection was orig inally 
(jDasgupta and Raftervl 119981 ) based on an approximation to Bayes factors (jKass and Raftervl . 119951 ). The 
eff ectiveness of the BIC fo r choosing the number of factors in a factor analysis model has been established 
bv lLopes and Westl (|2004h . 



4 Analysis of the White Bread Data 

4.1 The White Bread Data 

A total of n = 369 consumers tasted six out of 12 white breads in a BIB design. Taste was evaluated on the 
hedonic scale, so values in {1, 2, . . . , 9} are assigned to each tasted bread. For illustration, the first few rows 
of the data are shown in Table [IJ where the bread brands are denoted A, B, . . . , L. We fitted our mixture 
of factor analyzers model, with common factors, to these data using the PEM algorithm introduced herein. 
These models were fitted for G = 1, . . . , 6 and q = 1, . . . , 3, using multiple restarts. 

4.2 Results 

The results (Table [2J show that the BIC selected a model with G = 3 components and q — 2 factors. Note 
that we also ran standard EM algorithms on these data and can confirm that they converged to the same 
results as our PEM algorithms. A plot of the two latent factors (Figure Q]) shows the three components in 
the latent space. Because the classifications are based on maximum a posteriori (MAP) probabilities, it is 
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Table 1: The first six rows of the white bread data, where each consumer evaluates six breads using the 
hedonic scale. 



Consumer 


A 


B 


C 


D 


E 


F 


G 


H 


I 


J 


K 


L 


1 


9 




8 


6 








9 






4 


8 


2 


3 




8 




7 




8 


7 


8 








3 




8 


6 


7 










6 


9 


7 




4 






5 


4 




G 




4 


3 


6 






5 






7 


7 






8 


7 


6 




8 




6 








8 






3 


4 


8 




7 


7 



Table 2: BIC values from our 


analysis 


of the white bread data, for G = 1, . . . , 6 components and q 


= 1,...,3 


latent factors. 








Number of Latent Factors 


G 


1 


2 


3 


1 


5273.9 


5318.0 


5369.9 


2 


5176.1 


5136.0 


5193.1 


3 


5148.1 


5125.5 


5244.1 


4 


5182.2 


5171.5 


5285.0 


5 


5223.1 


5288.1 


5341.7 


6 


5374.1 


5439.1 


5492.7 




Figure 1: Plot of the two latent factors for the selected model, coloured by component (left), and a plot of 
the average liking scores for each of the breads separated by component (right). 



straightforward to provide the client with probabilities rather than hard group memberships; this might be 
particularly desirable for consumers near the cluster boundaries. 

In the same figure, there is also a plot of the mean liking scores for each bread for each of the three 
components. The red and green components seem to represent higher and lower scorers, respectively, with 
consumers within the black component exhibiting more variability in liking. 

Some interesting points emerge from inspection of the results. Notably, bread J emerges as polarizing: 
it is strongly liked in the red and black groups and disliked in the green group. Interestingly, bread J is 
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the only ciabatta-style bread in the study and so it makes sense that its sensory properties will result in 
a relatively extreme liking response. This liking contrast is useful in differentiating groups of consumers 
because the objective of this research is to understand the sensory-based choice behaviour of consumers to 
define an optimum product for each liking cluster. Bread I is also interesting, in that it is the one bread for 
which consumers in all three groups seem to converge to the similar liking scores. Bread I is the sweetest, 
most flavourful bread in the study; it is also firm, dense, moist, and chewy. This is an unusual combination 
of characteristics and one would expect it to stand out. The fact that it stood out by not differentiating 
consumers in this study is itself interesting in the process of trying to understand the sensory-based choice 
behaviour of consumers. 

4.3 Comparing PEM and EM 

The analysis of the bread data was repeated using a standard EM algorithm for parameter estimation. 
The results were the same, as we would expect. Figure [2] illustrates the progression of the EM and PEM 
algorithms with G = 1 and G = 2 components, respectively, and q = 2 latent factors. As expected, both 
algorithms converge to the same solution in an almost identical fashion. 




Figure 2: Plot of the log-likelihood for G — 1 and G — 2 for PEM (red line) and EM (black line) algorithms. 



5 Discussion 

We developed an approach for clustering incomplete BIB data from consumer tasting of 12 different com- 
mercial white breads. Our clustering approach is based on a parsimonious mixture of factor analyzers model, 
where the factor loading matrices are constrained to be equal across groups. The problem of missing data 
is handled along with parameter estimation within an partial EM algorithm framework. Rather than sim- 
ple imputation, this PEM algorithm approach effectively imputes missing data at each iteration based on 
current component membership probabilities; this is a natural approach as missing values are filled in based 
on complete values in observations that are in some sense similar (i.e., in the same component). Our PEM 
algorithm is much more computationally efficient than the standard EM algorithm for this, and any such, 
missing data problem. The PEM is shown to retain the monotonicity property and, thus, retains the same 
convergence properties as the EM algorithm. Three benefits are achieved through this approach: the quality 
of data that are collected prior to fatigue is improved; the method of substituting missing data reflects 
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the sensory preferences of each consumer, which permits robust cluster assignment; and the collection of 
incomplete-block data reduces the cost, time, and materials required for this type of study. 

We introduce a new variation of the EM algorithm called the PEM algorithm. The many varieties of the 
EM m ainly focus on the M-step: the expect ation-conditional max imization (ECM) algorithm (jMeng and Rubin 
19931 ). the ECM either (ECME) algorithm (|Liu and Rubinl . ll994f ). the alte rnating ECM (AECM) a lgorithm, 
and others. Few examine different ways of partially updating the E-step. iNeal and Hintonl (099 8) give sev- 
eral possible methods to update the missing sufficient statistics. All of the methods suggest something along 
the lines of fully upda ting a partial set of the missing sufficient statistics. When the E-step is intractable, 
Wei and Tanner (1990) suggest approximating the E-step by simulating m observations from the conditional 
distribution of the missing data given the observed data. T his version of EM algorithm is called Monte Carlo 
EM (MCEM). Prior to MCEM. ICeieux and Diebold (|l985l ) suggested using stochastic EM (SEM), which is 
the same as MCEM with m = 1. Other variations on approximating the E-step have been introduced, 
such as MCEM using rejection sampling, importance sampling, and Markov chain Monte Carlo. The PEM 
algorithm presented here is similar to ECM in which we have 'conditional' E-steps instead of conditional 
M-steps. These conditional E-steps are computationally cheaper than using a complete or full E-step. 
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A Some Mathematical Details 

A.l Schur Complement Relation in Quadratic Form 

Suppose we have a positive-definite symmetric matrix S and a vector y with decompositions 



yi 



and 



Sn S12 
S21 S22 



then 



y'S-V = yiS^yi + (y 2 - SaiS^O'S^ya - S 2 iS n Vi) 
where S22.1 = S22 - S 2 iS n 1 Si 2 . 



A. 2 A Matrix Minimization Problem 

Suppose we have a positive-definite symmetric matrix S with decomposition 



S = 



Sn S12 
S21 S22 



We then have the following property for a function 7, 

h(& n ) =tr{(Sn -0n,Si 2 )S- 1 (Sii -0u,Si 2 )} > tr {S^S^S'^} 



(17) 



Equality holds when ©n = Sn — S12S12S21. Therefore, h(&n) is minimized by the Schur complement 
©n = Sn — S12S12S21. Now, if wc define 



= 



©ii 




(18) 
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then 

^(©n) =tr{(S-0)S- 1 (S-©)} = 7 (0 11 ) + tr{S 2 - 2 1 S 22 S 22 }. (19) 

Because the right-hand term does not depend on 0n, fi(8n) has the same minimum as j(0n); i.e., the 
Schur complement ©n = Sn — Si 2 Si 2 S 2 i. Therefore, a minimization algorithm based on the function h 
is equivalent to minimizing 7. We minimize h using a conditional minimization algorithm (by column/row) 
based on Equation (|T7|) . 
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